-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
patch: group by n_unique
#917
Conversation
narwhals/_arrow/group_by.py
Outdated
"std": "stddev", | ||
"var": "variance", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found these other two cases while reading pyarrow grouped table aggregations (https://arrow.apache.org/docs/python/compute.html#py-grouped-aggrs)
narwhals/_dask/group_by.py
Outdated
def n_unique() -> dd.Aggregation: | ||
import dask.dataframe as dd # ignore-banned-import | ||
|
||
return dd.Aggregation( | ||
name="nunique", | ||
chunk=lambda s: s.apply(lambda x: list(set(x))), | ||
agg=lambda s0: s0.obj.groupby(level=list(range(s0.obj.index.nlevels))).sum(), | ||
finalize=lambda s1: s1.apply(lambda final: len(set(final))), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comes from the dask documentation itself.
For some reason, nunique
kw only works in
df_dd.groupby("a").b.nunique()
but not in agg
context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for digging this up in the dask docs!
I think this is way more complex than it needs to be - i've pushed something much simpler which seems to work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @FBruzzesi !
the main issue missing was the handling of null values, but i've added some commits to take care of that
What type of PR is this? (check all applicable)
Related issues
groupby-agg
raises #915Checklist
If you have comments or can explain your changes, please do so below.
Attempt for #915, comments in the code